robots txt
This article is a stub. You can help the IndieWeb wiki by expanding it.
robots.txt is a file used to inform web crawlers what parts of a site should or should not be crawled.
Because this file is just a suggestion and bots can choose to ignore it, it's not a guaranteed way of keeping away crawlers. But generally, ones from big search engines will respect it won't publicly index your site if you declare it so.
Example command names:
Examples
The following examples may be copy pasted into a plain text robots.txt file and placed at the root of your domain.
Brief example to block anything inside a particular top level directory "/wiki/":
User-agent: * Disallow: /wiki/
Note that Google seems to ignore the "*" User-agent and must be specifically disallowed:
User-agent: Googlebot Disallow: /wiki/
You may want to entirely block some particularly abusive bots:
User-agent: AhrefsBot Disallow: /
Directives to disallow GPTBot: https://platform.openai.com/docs/gptbot/disallowing-gptbot
User-agent: GPTBot Disallow: /
Directive to disallow ChatGPT: https://platform.openai.com/docs/plugins/bot
User-agent: ChatGPT-User Disallow: /
Directive to disallow use for Google Bard and Vertex AI generative APIs [1]
User-agent: Google-Extended Disallow: /
More examples:
See Also
- robots
- https://www.robotstxt.org/robotstxt.html
- LOL: https://web.archive.org/web/20140702214604/https://www.google.com/killer-robots.txt
- Google crawlerβs implementation of robots.txt: https://developers.google.com/search/docs/advanced/robots/robots_txt
- Google's C++ robots.txt parser: https://github.com/google/robotstxt
- for fun: https://www.last.fm/robots.txt with details at https://www.wired.com/2010/08/robot-laws/
- Go ahead and block AI web crawlers